Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink

نویسندگان

Supun Kamburugamuve

Pulasthi Wickramasinghe

Saliya Ekanayake

Geoffrey C. Fox

چکیده

With the ever-increasing need to analyze large amounts of data to get useful insights, it is essential to develop complex parallel machine learning algorithms that can scale with data and number of parallel processes. These algorithms need to run on large data sets as well as they need to be executed with minimal time in order to extract useful information in a time constrained environment. MPI is a widely used model for developing such algorithms in high-performance computing paradigm while Apache Spark and Apache Flink are emerging as big data platforms for large-scale parallel machine learning. Even though these big data frameworks are designed differently, they follow the data flow model for execution and user APIs. Data flow model offers fundamentally different capabilities than the MPI execution model, but the same type of parallelism can be used in applications developed in both models. This paper presents three distinct machine learning algorithms implemented in MPI, Spark, and Flink and compares their performance and identifies strengths and weaknesses in each platform. Keywords—Machine Learning, Big data, HPC, MDS, MPI, Spark, Flink, K-Means, Terasort

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...

متن کامل

Avoiding communication in primal and dual block coordinate descent methods

Primal and dual block coordinate descent methods are iterative methods for solving regularized and unregularized optimization problems. Distributed-memory parallel implementations of these methods have become popular in analyzing large machine learning datasets. However, existing implementations communicate at every iteration which, on modern data center and supercomputing architectures, often ...

متن کامل

Machine learning in ScalOps, a higher order cloud computing language

Machine learning practitioners are increasingly interested in applying their algorithms to Big Data. Unfortunately, current high-level languages for data analytics in the cloud (e.g., [2, 15, 16, 5]) do not fully cover this domain. One key missing ingredient is means to express iteration over the data (e.g., [19]). Zaharia et al., were the first to answer this call from a systems perspective wi...

متن کامل

Bridging the Gap between HPC and Big Data frameworks

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retainin...

متن کامل

AMIDST: a Java Toolbox for Scalable Probabilistic Machine Learning

The AMIDST Toolbox is a software for scalable probabilistic machine learning with a special focus on (massive) streaming data. The toolbox supports a flexible modeling language based on probabilistic graphical models with latent variables and temporal dependencies. The specified models can be learnt from large data sets using parallel or distributed implementations of Bayesian learning algorith...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IJHPCA

دوره 32 شماره

صفحات -

تاریخ انتشار 2018

Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink

نویسندگان

چکیده

منابع مشابه

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

Avoiding communication in primal and dual block coordinate descent methods

Machine learning in ScalOps, a higher order cloud computing language

Bridging the Gap between HPC and Big Data frameworks

AMIDST: a Java Toolbox for Scalable Probabilistic Machine Learning

عنوان ژورنال:

اشتراک گذاری